AI-generated code could imperil the software supply chain with ‘hallucinated’ dependencies—here’s why.

October 12, 2025 EquityExplained

A new body of research shows that AI-generated software code can introduce significant, often overlooked risks into the modern software supply chain. By producing code with references to non-existent third-party libraries, language models can seed vulnerable dependencies that attackers could exploit through supply-chain methods. The findings come from a large-scale study that tested a wide range of language models against real-world programming tasks, revealing a troubling pattern of “package hallucinations” that could empower dependency-confusion attacks if left unchecked. The results underscore the need for developers, security teams, and platform maintainers to rethink how AI-assisted coding is integrated into workflows and to strengthen verification, dependency management, and build hygiene across ecosystems.

Table of Contents

Understanding AI-generated code and the supply-chain risk

AI-assisted coding tools, powered by large language models, have become a core part of many development workflows. They can accelerate boilerplate work, suggest implementations, and even generate sizable chunks of complex logic. Yet when those tools generate code, they sometimes introduce references that don’t exist in the real world—non-existent package dependencies that a project would attempt to fetch from the software ecosystem. This phenomenon, termed package hallucination in the research community, represents a new dimension of risk for the software supply chain. It translates into a risk where downstream systems could inadvertently pull in malicious code or counterfeit libraries if a developer trusts the AI output without rigorous verification.

In the study at the heart of these revelations, researchers subjected a broad spectrum of language models to extensive code generation tasks. They used 16 widely used large language models to produce 576,000 distinct code samples. The goal was not merely to test whether the models could compile or run but to evaluate the integrity of the package references embedded within the code. The team analyzed millions of package references and found a substantial share of them pointed to packages that do not exist in real repositories. This finding signals a latent vulnerability in AI-generated code: the potential to seed supply-chain vectors with counterfeit or malicious dependencies that could be exploited downstream.

As software development increasingly hinges on reusable components, libraries, and dependencies, the integrity of those dependencies becomes critical. Dependencies relieve developers from re-implementing common functionality and are an indispensable part of modern software engineering. When those dependencies are hallucinated or misrepresented, they can derail builds, introduce backdoors, or enable data exfiltration and other nefarious actions. The study’s measurements demonstrate that this is not an anecdotal problem but a pervasive pattern that demands attention as AI becomes more embedded in coding practices.

What “package hallucination” means and how it relates to dependency confusion

Package hallucination occurs when an AI system outputs references to libraries, packages, or components that do not exist in the target ecosystem or that the model was not trained to reliably recall. In practice, a developer might see a generated code snippet that imports a package with a given name and version, only to discover that the package is non-existent or unrelated to the intended functionality. In the worst cases, a developer might install this imaginary dependency, bringing in code that behaves unpredictably or that embeds malicious payloads.

Concretely, these hallucinations feed into a broader attack class known as dependency confusion (also called package confusion). The basic idea is deceptively simple: an attacker publishes a counterfeit library under a name that matches a legitimate package but with a versioning or hosting arrangement that tricks the downstream system into pulling the attacker’s copy instead of the authentic one. If a software project relies on a component from a public registry or a shared repository, and the dependency resolution process misinterprets the package’s provenance or version, the malicious package can be installed during normal update or build steps. The result can be that trusted software ends up pulling in rogue code, enabling data theft, stealthy backdoors, or other harmful actions.

The notion of dependency confusion became public knowledge through high-profile demonstrations in 2021, where attackers managed to inject counterfeit code into the networks of large organizations. Those demonstrations highlighted a fundamental weakness in how supply chains trust public repositories and versioned artifacts. The risk is not purely hypothetical: it sits at the intersection of how developers discover and integrate external code, how packaging systems manage dependencies, and how AI-assisted tooling suggests and generates code with embedded import statements or package references.

In the current research, the researchers emphasize that the chain reaction from hallucinated dependencies can be particularly dangerous when a model suggests package names that appear plausible and credible to a developer. If a user trusts the AI’s output and proceeds to install or incorporate the suggested dependency without rigorous verification, any payload embedded in the malicious package could be executed on the user’s system. This chain—from AI-generated code to negligent verification to exploitation—highlights a new frontier in software risk that blends machine-generated content with traditional supply-chain exploits.

The study’s methodology and key findings

The research team designed a rigorous, multi-language evaluation to quantify how often AI-generated code references non-existent packages and how this might translate into real-world risk. They conducted a structured set of experiments across two commonly used programming languages: Python and JavaScript. In total, 576,000 code samples were produced, distributed evenly across a suite of tests, with 30 distinct test runs per model. Each test generated a substantial volume of code, ensuring that the analysis captured a broad and representative snapshot of how LLMs perform in coding tasks under varied prompts and scenarios.

Among the 2.23 million total package references embedded in those samples, 440,445 references pointed to packages that do not exist. This figure represents 19.7 percent of all references—nearly one in five. The scale is disconcerting, particularly because a substantial portion of the hallucinations—205,474, or roughly half of the non-existent packages—had unique package names. In other words, the study did not merely observe the same few hallucinated names repeatedly; there were many distinct phantom packages being suggested across the prompts and iterations.

A striking aspect of the results is the persistence of hallucinations. The researchers found that 43 percent of package hallucinations appeared repeatedly across more than 10 queries. Moreover, in 58 percent of cases, a hallucinated package name was repeated more than once within 10 iterations. This persistence indicates that the hallucination phenomenon is not a random error but a repeatable pattern that endures across iterative prompts and sessions. For malicious actors seeking reliable exploitation vectors, this repeatability could significantly lower the effort required to identify non-existent but consistently suggested package names and to publish corresponding malware that targets those names.

The study also compared model families and language ecosystems to understand where hallucinations tend to be more prevalent. Open-source LLMs, such as CodeLlama and DeepSeek, produced an average hallucination rate approaching 22 percent, whereas commercial models performed better in this specific metric, with rates just above 5 percent. In terms of programming language, Python code generally yielded fewer hallucinations than JavaScript. Python saw an average hallucination rate near 16 percent, while JavaScript showed a rate slightly above 21 percent. These disparities prompt questions about how training data, model fine-tuning, and namespace complexity influence the tendency to hallucinate in dependency references.

The researchers offered cautious interpretations of these differences. They noted that large commercial models typically have parameter counts and potentially architectural features that differ from open-source alternatives, which could contribute to the observed gap. Many experts suggest that commercial models tend to have richer training data, broader optimization objectives, and more extensive safety and alignment practices that can indirectly reduce certain errors, including hallucinations about dependencies. However, within the open-source cohort, model size alone did not show a clear, direct correlation with hallucination rate. The authors observed that other factors—training data quality, instruction tuning, and safety constraints—likely play substantial roles.

Beyond model size, the study highlights how the ecosystem structure influences risk. JavaScript’s package ecosystem is far larger and more complex than Python’s, with an estimated order of magnitude more packages and a more crowded namespace. This complexity contributes to greater uncertainty in recalling exact package names and makes hallucinations more prevalent in the JavaScript context. The researchers suggest that the sheer scale and dynamism of the JavaScript package landscape pose particular challenges for AI-generated code, increasing the likelihood that a model will propose non-existent or misnamed components in ways that are not easily caught by developers during review.

When discussing the broader implications, the researchers point to the untrustworthiness that currently underpins AI-generated outputs. They reference industry perspectives predicting that a large share of code will be AI-generated within a few years, which underscores the urgency of implementing robust safeguards. The study’s findings serve as a reminder that AI assistance, while powerful, does not inherently produce trustworthy code without careful verification and secure development practices.

How package hallucinations translate into real-world risks

The practical risk of package hallucinations centers on how developers build, update, and deploy software. When an AI-assisted tool suggests a non-existent package, a developer might proceed to install or depend on it in a project. If the project’s build process resolves dependencies based on those recommendations, the hypothetical scenario becomes operational: the project may fetch a phantom library, integrate it into the codebase, and unknowingly execute code with potential vulnerabilities or malicious intent. The attacker’s objective could be to gain unauthorized access, exfiltrate data, or install backdoors that persist across updates and downstream systems.

A broader risk arises when hallucinated dependencies are not a one-off error but a recurring pattern. Attackers could exploit this by compiling a catalog of non-existent packages that the model frequently proposes. They could then publish malicious packages under the same names, wagering that these names will appear repeatedly in AI-assisted development workflows. If a significant subset of developers uses AI-generated code without deeply validating every dependency, this approach could realize a scalable supply-chain attack vector that thrives on repetition and predictability.

The persistence of hallucinations is particularly concerning because it creates a predictable attack surface. If a package name keeps surfacing across many prompts and iterations, it becomes a "sweet spot" for attackers to target. A malware payload concealed within a counterfeit package could be distributed widely, and because the name is repeatedly suggested by the model, the probability that a developer will encounter and install it rises. In this sense, the synergy between AI-generated content and dependency confusion becomes a fertile ground for cybercriminals.

The findings also illuminate how AI-generated code interacts with standard software development workflows. Developers often rely on package managers that fetch dependencies from public or private registries. When AI-generated code references packages that do not exist, the risk is not merely that a build will fail; it is that the system could inadvertently fetch malicious artifacts under the guise of legitimate software supply chain components. As teams increasingly adopt automation, CI/CD pipelines, and AI-assisted tooling, the potential attack surface expands if verification and validation do not keep pace with tooling enhancements.

The language and model-disparity landscape revealed by the study

A key takeaway from the study is that the propensity for hallucinations is not uniform across languages, models, or ecosystems. The researchers found significant disparities between open-source and commercial models. Open-source models exhibited a higher rate of package hallucination, averaging around 22 percent, while commercial models tended to perform better in this metric, showing a little over 5 percent. This difference is likely influenced by several factors, including parameter counts, data diversity, and the breadth of training data used for commercial models. Commercial models often carry larger parameter counts and more extensive optimization and safety tuning, which may help reduce certain error patterns, including hallucinations of dependencies. However, this does not imply that commercial models are immune to package hallucination; it simply indicates a relative difference in observed frequency under the study’s conditions.

In the Python ecosystem, code generated by models tended to exhibit fewer hallucinations compared to JavaScript. The Python-based code had an average hallucination rate of roughly 16 percent, whereas JavaScript code hovered around 21 percent. Several hypotheses could explain this discrepancy. Python’s package ecosystem is sizable but perhaps more homogeneous in naming conventions, with a somewhat more centralized pattern in dependency packaging. JavaScript, by contrast, has a vastly larger and more compartmentalized ecosystem, with a broader array of namespace conventions, sub-packaging patterns, and a faster release cadence. The increased surface area and naming complexity in JavaScript could contribute to higher rates of hallucination as models attempt to recall and align with a large and evolving set of package names and versions.

The study’s authors also discuss the possible role of data curation, fine-tuning, and safety alignment in shaping hallucination rates. They note that model training pipelines—particularly instruction-following training and safety-oriented refinements—could have unintended consequences on how models recall and propose dependencies. The interplay between model capabilities and instruction handling is complex: instruction tuning might improve a model’s ability to follow prompts but may also constrain its recall in ways that affect how accurately it retrieves library names and versions. The researchers emphasize that more work is needed to disentangle these effects and to identify mitigation strategies that can preserve model usefulness while reducing risky hallucinations.

Beyond model size, the training corpus and the specificity of the task appear to influence the outcome. The authors remark that among open-source models, there was no simple correlation between having more parameters and lower hallucination rates. Instead, the quality and diversity of the training data, as well as the alignment strategies used to ensure safe and helpful outputs, matter more. These insights suggest that improvements in AI-assisted coding must be coupled with rigorous data curation, robust tooling for dependency verification, and education for developers about the limitations and risks of AI-generated code.

Implications for developers, security teams, and software supply chains

The emergence of package hallucinations as a credible threat vector has several practical implications for software development and security practices. First, there is a clear call to strengthen verification mechanisms around AI-generated code. Developers should implement multi-layer checks: static analysis that can flag references to non-existent or suspicious dependencies, automated verification of package metadata against authoritative registries, and reproducible builds that ensure the exact same dependencies are fetched in every environment. The goal is to catch hallucinated packages before they can be resolved and installed in a live environment.

Second, dependency management practices must adapt to the reality that AI-generated content can introduce non-existent components. Teams may consider instituting stricter review rails for any dependency added or suggested by an AI tool. This could include mandatory cross-checks with official package registries, version pinning, and explicit approval workflows for dependencies that appear in generated code but are not present in the project’s existing dependency graph. Such safeguards would slow the potential exploitation window and give security teams a chance to intervene before a malicious package is introduced into the supply chain.

Third, build and deployment pipelines require heightened scrutiny. Continuous integration and deployment workflows should incorporate dependency auditing and integrity checks. For example, pipelines could require a software bill of materials (SBOM) that enumerates all dependencies and validates that each dependency corresponds to a real, verifiable artifact. When AI-generated code introduces a new dependency, the SBOM could trigger an automated verification step that confirms the existence and legitimacy of the package and its publisher.

Fourth, there is a need for safeguards in AI tooling itself. AI code assistants could be equipped with built-in validation features that automatically verify any suggested dependencies in real time. Such features could cross-reference package names and metadata with trusted registries, alerting the developer to potential hallucinations or suggesting safe alternatives. Additionally, models could be fine-tuned with a focus on reducing dependency hallucination, for instance by biasing the model toward verified package catalogs and by incorporating safeguards that avoid proposing non-existent names.

Fifth, education and awareness remain critical. Developers should be trained to recognize the signs of dependency confusion risk and to implement best practices for secure coding with AI assistance. This includes understanding the difference between a plausible-sounding package name and a verified, existing artifact, and adopting a principle of least astonishment when integrating AI-generated recommendations. Teams should cultivate a culture of meticulous verification, especially in high-stakes or production-critical systems where the consequences of supplying malicious code could be severe.

Finally, the broader software ecosystem—package registries, language communities, and tooling ecosystems—must collaborate to improve resilience against dependency-based attacks. Registry operators can implement stricter publishing protocols, ensure robust provenance metadata, and support integrity checks that help downstream tools verify artifact authenticity. Language-specific ecosystems can standardize safer dependency resolution strategies and provide clearer signals about package trustworthiness. Security researchers and platform maintainers should share findings and develop best practices that help developers navigate the evolving landscape where AI-assisted coding intersects with supply-chain security.

Language ecosystems, model types, and practical takeaways for teams

The study’s comparative findings underscore that there is no one-size-fits-all solution. Teams using AI-assisted coding should tailor their security and development practices to the specific languages and model types in their stack. For Python-centric projects, the lower observed hallucination rate relative to JavaScript is encouraging, but it does not eliminate risk. Even a 16 percent hallucination rate is too high for critical systems. For JavaScript-heavy projects, the risk is more pronounced due to the ecosystem’s sheer size and complexity. This means that JavaScript projects may require more stringent checks and more robust dependency governance when leveraging AI-generated code.

Open-source models, while appealing for transparency and control, demonstrated higher hallucination rates in this study. Organizations relying on open-source AI tooling should be particularly mindful of this risk and implement compensating controls. Conversely, commercial models, while not perfect, showed a lower incidence of hallucinated dependencies in the study. This does not imply that they are immune to risk, but it suggests that the choice of model type can influence dependency integrity. Regardless of the model type, a consistent, auditable approach to dependency verification is indispensable.

From a workflow perspective, integrating AI-generated code into a secure development lifecycle requires a layered strategy. Start with plan-and-check practices where teams outline the intended functionality and dependencies before generating code. Use code-generation prompts that steer models toward using well-known, widely maintained libraries with clear provenance. After generation, route code through a dependency hygiene process that flags any references to non-existent or untrusted packages. Maintain a robust SBOM and enforce policy-based approvals for any new dependencies introduced by AI-assisted workflows.

Despite the gains in speed and productivity that AI-assisted coding can deliver, security cannot be sacrificed for convenience. The balance between innovation and risk must be managed through people, processes, and technology working in concert. Teams should continually assess the evolving threat landscape—especially as AI tooling becomes more pervasive across development environments—and adapt their practices to preserve the integrity of the software supply chain.

Industry context, expert views, and a forward-looking stance

Experts and industry observers have been sounding alarms about AI-generated code and the potential security implications for a while. The research discussed here adds a data-driven basis to concerns about how language models may inadvertently propagate non-existent dependencies into real-world projects. In the broader context, many technologists anticipate substantial automation in code production, with a widely cited forecast suggesting that a majority of new code could be AI-generated within a few years. If accurate, this trajectory could magnify both the productivity benefits and the risk surface associated with AI-assisted development.

Security researchers emphasize that this is not solely a technical problem; it intersects with organizational practices, developer education, and vendor strategies. The convergence of AI-generated content with modern supply chains requires a multi-stakeholder approach: developers must practice rigorous validation, security teams must implement comprehensive auditing, registries must enforce stronger provenance controls, and AI-tool designers must embed safety features that reduce the likelihood of hallucinations and related issues.

The study’s findings also reflect a broader lesson about AI reliability and trust. While AI holds enormous potential to streamline repetitive coding tasks and accelerate software delivery, it does not automatically deliver correctness or security. The responsible use of AI in software development demands continuous validation, transparent risk assessment, and proactive mitigations that align with the goals of dependable, secure, and trustworthy software ecosystems.

Conclusion

The revelations about AI-generated code and the prevalence of package hallucinations illuminate a critical vulnerability in the modern software supply chain. As large language models become more embedded in coding workflows, the risk of non-existent or malicious dependencies entering production increases unless robust safeguards are adopted. Dependency confusion and package hallucination are real, measurable phenomena with concrete consequences for developers, security teams, and organizations that rely on open-source ecosystems and modern packaging practices.

The study’s comprehensive approach—spanning multiple languages, dozens of models, and hundreds of thousands of generated samples—provides a clear call to action. Developers must employ rigorous dependency verification, maintain accurate software bills of materials, and enforce safe practices when incorporating AI-suggested code. Security teams should design and implement end-to-end validation pipelines that detect hallucinated dependencies before they can impact builds and deployments. Open-source and commercial model providers alike need to invest in training data quality, alignment strategies, and tooling that help reduce hallucinations while preserving the usefulness of AI-assisted coding.

Ultimately, the path forward rests on a combined effort: strengthening the integrity of the software supply chain, enhancing the reliability of AI-generated outputs, and cultivating a culture of meticulous verification in every step of the development process. By adopting layered safeguards, embracing rigorous verification, and prioritizing transparency and provenance, the industry can harness the benefits of AI-assisted coding while mitigating the risks posed by package hallucinations and dependency-confusion attack vectors.